arXiv:1509.04567vl [math.NA] 15 Sep 2015 


A BLOCK KRYLOV SUBSPACE IMPLEMENTATION OF THE 
TIME-PARALLEL PARAEXP METHOD AND ITS EXTENSION FOR 
NONLINEAR PARTIAL DIFFERENTIAL EQUATIONS 

GIJS L. KOOIjt, MIKE A. BOTCHEV*, AND BERNARD J. GEURTSt§ 

Abstract. A parallel time integration method for nonlinear partial differential equations is 
proposed. It is based on a new implementation of the Paraexp method for linear partial differential 
equations (PDEs) employing a block Krylov subspace method. For nonlinear PDFs the algorithm 
is based on our Paraexp implementation within a waveform relaxation. The initial value problem 
is solved iteratively on a complete time interval. Nonlinear terms are treated as a source term, 
provided by the solution from the previous iteration. At each iteration, the problem is decoupled 
into independent subproblems by the principle of superposition. The decoupled subproblems are 
solved fast by exponential integration, based on a block Krylov method. The new time integration is 
demonstrated for the one-dimensional advection-diffusion equation and the viscous Burgers equation. 
Numerical experiments confirm excellent parallel scaling for the linear advection-diffusion problem, 
and good scaling in case the nonlinear Burgers equation is simulated. 

Key words, parallel computing, exponential integrators, partial differential equations, parallel 
in time, block Krylov subspace. 
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1. Introduction. Recent developments in hardware architecture, bringing to 
practice computers with hundreds of thousands of cores, urge the creation of new, 
as well as a major revision of existing numerical algorithms [14]. To be efficient on 
such massively parallel platforms, the algorithms need to employ all possible means 
to parallelize the computations. When solving partial differential equations (PDEs) 
with a time-dependent solution, an important way to parallelize the computations 
is, next to the parallelization across space, parallelization across time. This adds a 
new dimension of parallelism with which the simulations can be implemented. In 
this paper we present a new time-parallel integration method extending the Paraexp 
method [21] to nonlinear partial differential equations using Krylov methods and 
waveform relaxation. 

Several approaches to parallelize the simulation of time-dependent solutions in 
time can be distinguished. The first important class of the methods are the waveform 
relaxation methods [6, 38, 44, 56], including the space-time multigrid methods for 
parabolic PDEs [8, 24, 30, 37]. The key idea is to start with an approximation to 
the numerical solution for the whole time interval of interest and update the solution, 
solving an easier-to-solve approximate system in time. The Parareal method [36], 
which attracted significant attention recently, is a prime example related to the class 
of waveform relaxation methods [19]. 

Parallel Runge-Kutta methods and general linear methods, where the parallelism 
is determined and restricted by the number of stages or steps, form another class of 
the time-parallel methods [8, 12, 52]. Time-parallel schemes can also be obtained 
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by treating the time as an additional space dimension and solving for values at all 
time steps at once [13, 57]. This approach requires significantly more memory and 
is used, e.g., in multidimensional tensor computations and (discontinuous) Galerkin 
finite element methods [9, 31, 35, 54]. Recently a “parallel full approximation scheme 
in space and time” (PFASST) was introduced, which is based on multigrid meth¬ 
ods [17]. PFASST was observed to have a generally improved parallel efficiency com¬ 
pared to Parareal. Also, we mention parallel methods that facilitate parallelism by 
replacing the exact solves in implicit schemes by approximate ones [7], and parallel 
methods based on operator splitting [4]. Finally, there is the Paraexp method [21] 
for parallel integration of linear initial-value problems. This algorithm is based on 
the fact that linear initial-value problems can be decomposed into homogeneous and 
nonhomogeneous subproblems. The homogeneous subproblems can be solved fast by 
an exponential integrator, while the nonhomogeneous subproblems can be solved a 
traditional time-stepping method. 

In this paper we propose a time parallel method which is based on the matrix 
exponential time stepping technique and waveform relaxation. Our method adopts the 
Paraexp method within a waveform relaxation framework, which enables the extension 
of parallel time integration to nonlinear PDFs. The method is inspired by and relies 
on recent techniques in Krylov subspace matrix exponential methods such as shift- 
and-invert acceleration [22, 40, 51], restarting [1, 10, 16, 45, 50] and using block Krylov 
subspaces [2] to handle nonautonomous PDFs efficiently. The method also relies on 
a singular value decomposition (SVD) of source terms in the PDF, which is used to 
construct the block Krylov subspace. To improve the efficiency of the method, the 
SVD is truncated by retaining only the relatively large singular values. We show in 
theory and in practice that for a source term, that can be approximated by a smooth 
function, the singular values decay algebraically with time interval size. In that case, 
a truncated SVD approximation of the source term is adequate. 

The contribution of this paper is more specifically as follows. First, to solve sys¬ 
tems of linear ODFs, we propose an implementation of the Paraexp method, based 
on the exponential block Krylov (FBK) method, introduced in [2]. The FBK method 
is a Krylov method that approximates the exact solution of a system of nonhomo¬ 
geneous linear ordinary differential equations by a projection onto a block Krylov 
subspace. Our Paraexp implementation does not involve a splitting of homogeneous 
and nonhomogeneous subproblems, which leads to a better parallel effiency. Second, 
we extend our FBK-based implementation of the Paraexp method to nonlinear initial- 
value problems. We solve the problem iteratively on a chosen time interval. In our 
case, the nonlinear term of the PDF is incorporated as a source term, which is up¬ 
dated after every iteration with the latest solution. Third, we show that our Paraexp- 
FBK (PFBK) implementation has promising properties for time-parallelization. The 
PFBK method can be seen as a special continuous-time waveform relaxation method, 
which is typically more efficient than standard waveform relaxation methods [6]. The 
PFBK method is tested for the one-dimensional advection-diffusion equation and the 
viscous Burgers equation in an extensive series of numerical experiments. Since we are 
interested in time-parallel solvers for large-scale applications in computational fluid 
dynamics (CFD), such as turbulent flow simulations, we also test our method with 
respect to the diffusion/advection ratio and grid resolution, reflecting high Reynolds- 
number conditions. For example, the parallel efficiency of the Parareal algorithm was 
found to decrease considerably with increasing number of processors for the advection 
equation [20, 47, 48]. In contrast, the FBK method was found to have excellent weak 
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scaling for linear problems. 

The paper is organized as follows. In Section 2 we describe the exponential time 
integration scheme and its parallelization. In Section 3, we present numerical exper¬ 
iments with the advection-diffusion equation, and with the viscous Burgers equation 
in Section 4. Finally, a discussion and conclusions are outlined in Section 5. 

2. Exponential time-integration. Our time integration method for linear and 
nonlinear PDEs is based on the exponential block Krylov (EBK) method [2]. In this 
section, we first provide a brief description of the EBK method. Then, we extend 
the EBK method to integrate nonlinear PDEs in an iterative way. Finally, the paral¬ 
lelization of the time integrator is discussed. 

2.1. Exponential block Krylov method. The EBK method is a time inte¬ 
grator for linear systems of nonhomogeneous ordinary differential equations (ODEs). 
Details of the method are given in [2]. We follow the method of lines approach [55], 
i.e., the PDE is discretized in space first. We start with linear PDEs. After applying 
a spatial discretization to the PDE, we obtain the initial value problem, 

W{t)=Au{t)+g{t), 
u(0) = uo. 


where u(t) is a vector function u(t) : R —>■ R", A G is a square matrix, and 

g(t) G M" a time-dependent source term. The vector u can readily be identified 
with the solution to the PDE in terms of its values taken on a computational grid 
in physical space. In Section 2.3, we will introduce a more general term g(t,u(t)), 
which contains the nonlinear terms of the PDE. The matrix A is typically large and 
sparse, depending on the spatial discretization method. The dimension of the system, 
n, corresponds to the number of degrees of freedom used in the spatial discretiza¬ 
tion. Exponential integrators approximate the exact solution of the semi-discrete 
system (2.1), formulated in terms of the matrix exponential. 

The first step in the EBK method is to approximate g(t) with a function that 
can be treated easier. Here, g(t) is approximated based on a truncated singular value 
decomposition (SVD) [3] as follows. The source term is sampled at s points in time, 
0 = < t 2 < ... < ts-i < ts = AT, in the integration interval [0, AT], and the 

source samples form the matrix 

G=[g{h) g{h) ... g(t,)]GR">^^ (2.2) 

We make a natural assumption that the typical number of time samples s, necessary 
for a good approximation of g(t), is much lower than n: s n. Since we assume 
s <C n, it is more economical to calculate the so-called thin SVD of G [23], instead of 
the full SVD, without loss of accuracy. In this case, the thin SVD of G is 

G = (2.3) 

where E G is a diagonal matrix containing the singular values ci > (72 > ... > 

(Ts, and U G V G R®^® are matrices with orthonormal columns. The thin SVD 

can be approximated by a truncation in which the m < s largest singular values are 
retained. As seen from this truncated SVD, the samples of g{t) can be approximated 
by linear combinations of the first m columns of U, i.e.. 


g{t) fv Up{t). 


( 2 . 4 ) 
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where U G is formed by the first m columns of U, p{t) G K.™ is obtained 

by an interpolation of the coefficients in these linear combinations [3]. There are 
several possible choices for p(t), among which, cubic piecewise polynomials. Then, 
the approximation error in the source term, ||g(t) — 17p(t)||, can be easily controlled 
within a desired tolerance, depending on the number of samples in [0, AT], and the 
number of singular values truncated (see [2]). 

The number of retained singular values, m, is equal to the block width in the block 
Krylov subspace. The efficiency of the EBK method therefore depends on how many 
singular values are required for an accurate approximation of the source term (2.4). In 
applications of the method this parameter m can be varied and practical convergence 
can be assessed. If the subinterval AT is small and the source function g(t) can 
be well approximated by a smooth function, then the singular values of the samples 
of g decrease rapidly, thus allowing for an efficient low rank approximation in (2.4). 
Furthermore, let the subinterval length AT be small in the sense that (AT)^ <C AT (in 
dimensionless time units). Of course, the assumption that AT is small, is not always 
realistic and we comment on how it can be partially relaxed below in Remark 1. 
Denote by g^-l^ the jth derivative of g and assume that the constants 

M,- := max j = 0,... ,n, 

4e[t,t+AT] 

are bounded. We can then show that the singular values of the samples decrease, 
starting from a thin QR factorization of G. 

Consider the thin QR factorization of G, G = QR, where Q G has orthonor¬ 
mal columns and i? G is an upper triangular matrix with nonnegative diagonal 
entries. As Theorem 1 in [25] states, the entries of R satisfy 

|r,fc| <G,M,_i(AT)J-i, k>j, 

and Tjk = 0 for k < j because R is upper triangular. In this estimate the constants 
Cj depend only on the points , ..., G at which the samples are computed and do 
not depend on g and AT. 

Since G and R have the same singular values, we consider now the singular values 
of R. According to [29, 3.1.3], 

(Tj+I > WRjh, j = l,...,s-l, (2.5) 

where Rj is a matrix obtained from R by skipping j rows or columns, chosen arbitrar¬ 
ily. Since the entries in R decrease rowwise as \rjk\ = 0{ATy~^, to have the sharpest 
estimate in (2.5), we choose Rj to be the matrix R with the first j rows skipped. To 
bound the 2-norm of Rj we use [28, Chapter 5.6] 

||i?,||2 < ^||i?,|ll||i?,||oo 

and note that for j = 1,..., s — 1 

WRjWoo < {s - j)c,M,.yATy-\ IIT,111 = o{ATy-\ 

Thus, we obtain ||i?j ||2 = Vs — j 0{ATy~^, which, together with (2.5), yields the 
following result. 

Theorem 2.1. Let g(t) : R —>■ R" be a smooth function such that the constants 
Mj := max (t)|| j = 0,...,n, 

t6[0,AT] 




A TIME-PARALLEL EXPONENTIAL INTEGRATOR 


5 


are bounded. Furthermore, let the subinterval length AT be small in the sense that 
(AT)^ <C AT. Then for the singular values Gj, j = 1,..., s, of the sample matrix 

G=[g(ii) g(i2) ... g(G)]GK”^^ 

holds 

CTj+i = Vs - j 0{ATy, j = l,...,s-l. (2.6) 


Remark 1. We may extend the result of Theorem 2.1 to the case of a large 
AT if the constants Mj are bounded more strongly or even decay with j. Indeed, if 
AT is large, we can consider the function g(^) g(t -I- ^q), with q chosen such that 
T := AT/q is small. As the function g takes the same values for ^ G [0,t] as the 
function g for ^ G [t,t + AT], Theorem 2.1 formally holds for g with AT replaced 
by T and Mj multiplied by ql. The coefficients in the O symbol in (2.6) may now 
grow with the powers of AT, thus rendering the result meaningless unless a stricter 
assumption on Mj is made. 

The truncated SVD approximation of the source term (2.4) facilitates the solution 
of the initial value problem (IVP) in (2.1) by the block Krylov subspace method [2]. 
Here, the block Krylov subspace at iteration I is defined as 

lCi{A,U) :=span{U,AU,A^U,...,A'‘-^U] . (2.7) 

Furthermore, the block Krylov subspace method can be accelerated by the shift-and- 
invert technique [22, 40, 51] and, to keep the Krylov subspace dimension bounded, 
implemented with restarting [5, 10]. 

2.2. Parallelization of linear problems. In this section, the parallelization 
of the linear ODE solution is discussed. As we will see, the method is equivalent to 
the Paraexp method, but there are differences in implementation leading to a better 
parallel efficiency (see Section 3.2). Linear IVPs can be solved parallel in time by 
using the principle of superposition. First, the time intervals is divided into P non¬ 
overlapping subintervals, 


0 = To < Ti < ... < Tp = T, 


where P is also the number of processors that can be used in parallel. We introduce 
the shift of u{t) with respect to the initial condition, u(t) = Uq + u(t). The shifted 
variable u{t) solves the IVP with homogeneous initial conditions. 


u'(t) = Au(t)+g(t), tG[0,T], 

u(0) = 0, 


( 2 . 8 ) 


where 


g(f) := Auo + g{t). 


(2.9) 


The source term is approximated using s time samples per subinterval, as illustrated 
in Fig. 2.1. The shifted IVP (2.8) can be decoupled into independent subproblems by 
the principle of superposition. To each subinterval [Tj,Tj-i\ we associate a part of 
the source term g(t), defined for j = 1, ..., T as 


f S{t), for Tj_i <t< Tj, 
\ 0, otherwise. 


( 2 . 10 ) 
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Tj-1 Tj Tj+i Tj + i 


-1-^^^^ - 

to t\ ■ * ■ tg — \ tg 

Figure 2.1: Time samples of the source term on the subinterval [Tj,Tj+i]. 


The expected parallel speedup is based on the observation that the solution to (2.8) 
is given by a variation-of-constant formula (see, e.g., [26]), 

/•T E 


u(T) = / exp((T-s)A)g(s)(is = ^ / exp ((T - s)A) gj(s) ds, (2.11) 
j Q A _ -y J Tj — \ 


where the integrals ^ . .ds can be evaluated independently and in parallel. More 
precisely, by exploiting the linearity of the problem we decompose the IVP (2.20) into 
P independent subproblems, 


v'(t) = Avj(t)+ gj{t), t e [o,r], 
v,(0) = o. 


( 2 . 12 ) 


where Vj (t) is referred to as a subsolution of the total problem. The subproblems are 
independent and can be solved in parallel. We integrate the subproblem individually 
with the EBK method. Observe that the source is only nonzero on the subinter¬ 
val [Tj_i,Tj). We follow the ideas of the Paraexp method [21] and note that the 
nonhomogeneous part of the ODE requires most of the computational work in the 
EBK method. An accurate SVD approximation of the source term (2.4) generally 
requires more singular values to be retained, increasing the dimensions of the block 
Krylov subspace (2.7). According to the principle of superposition, the solution of 
the original problem is then given as 


p 

u(t) = uo+^Vj(t). (2.13) 

1=1 

The summation of the subsolutions, Vj(t) is the only communication required between 
the parallel processes. The parallel algorithm is summarized in Fig. 2.2. In principle, 
this algorithm is identical to the Paraexp method [21]. The only practical difference 
is that in our implementation both the nonhomogeneous and the homogeneous part 
of the subproblems are solved by the EBK method. The original version of the 
Paraexp method assumes a convential time integration method to solve the “difficult” 
nonhomogeneous part, and a Krylov subspace method for exponential integration of 
the homogeneous part. Our implementation of the Paraexp method is compared 
numerically with the original one in Section 3.2. 

Finally, additional parallelism could be exploited within the block Krylov sub¬ 
space method itself. If the block size is m, then the m matrix-vector products can be 
executed entirely in parallel, see [34, 46]. This approach could be applied in combi¬ 
nation with the parallelization described in this section. 
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Algorithm parallel time integration. 

Given: A, Uq, g{t), ... 

Solve: u'{t) = Au(<) -P g{t), u(0) = Uq. 

for j = 1,... ,P ( in parallel) 

Solve nonhomogeneous part of the ODE (2.12), for t S [Tj-i,Tj]. 
Solve homogeneous part of the ODE (2.12), for t G [T^^Tp]. 

end for 

Construct solution u(f), see (2.13). 


Figure 2.2: The algorithm for solving linear ODE systems with the Paraexp exponen¬ 
tial block Krylov (PEEK) method. 


2.3. Treatment of nonlinearities. The EBK method, designed to solve a lin¬ 
ear system of nonhomogeneous ODEs (2.1), can be extended to handle nonlinear 
systems of ODEs by including the nonlinearities in the source term. The system is 
then solved iteratively, in such a way that the iterand Uk(t) is updated on the entire 
interval [0,T]: 

Ufe(t) i-A Ufc+i(t), (2.14) 

where k denotes the iteration index. We proceed in a few steps. First, the problem 
is reduced to the form of Eq. (2.1). The nonlinear IVP is therefore approximated 
by evaluating the source term with the current iterand, Uk{t). Because the current 
iterand (t) is a known function, we can write the source term as an explicit function 
of time, 


gk{t) := g{t,Uk{t)). 


(2.15) 


The resulting initial value problem is. 


Ufc+i(t) = Auk+i(t) + gk(t), 

Ufc+i(0) = Uq. 


(2.16) 


This system is solved by the EBK method to achieve one iteration. This process 
is continued until the solution is sufficiently converged. This approach is similar 
to applying Picard or fixed-point iterations [42] on the nonlinear term. The source 
term can be further improved for nonlinear behaviour by using the Jacobian matrix, 
similar to a Newton-Rhapson method. In this case, we introduce the average Jacobian 
matrix. 


Jk ■ — 


r 

'dgk 

dgk 

Jo 

dui 

dun_ 


(2.17) 


which is averaged over the interval of integration [0, T]. Generally, the solution varies 
in time and so does the Jacobian matrix. The nonlinear remainder, with respect to 
the time-averaged state, is contained in the source term. Using this correction, we 
then find the recursive relation. 


Ufc+i(t) = [A + Jk]uk+i{t) -L gk{t) - JfcUfe(t), 

Ufe+i(0) = Uq. 


(2.18) 
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When converged, Ufe+i(i) = Uk{t), the terms containing Jk eventually disappear. 

Remark 2. The iteration (2.18) is well known in the literature on waveform 
relaxation methods [37, 38]. Its convergence is given, e.g., in [5, 42] and, in case of 
an inexact iteration, in [6]. These results, in particular, show that the iteration (2.18) 
converges superlinearly on finite time intervals if the norm of the matrix exponential 
oft{A + Jk) decays exponentially with t, i.e., the eigenvalues of A + Jk lie in the left 
half-plane. 

2.4. Parallelization of nonlinear problems. Nonlinear IVPs are solved in an 
iterative way with the PEEK method. We follow the waveform relaxation approach 
[6, 33, 42], that is, the problem is solved iteratively on the entire time interval of 
interest, [0,T]. The original problem is decomposed into a number of independent, 
“easier” subproblems on [0,T] that can be solved iteratively, and in parallel. 

At each iteration the nonlinear term is treated as source term, which gives us a 
linear system of ODEs that can be integrated in parallel, see Section 2.2. In the case 
of nonlinear problems, we take the average of the Jacobian matrix on each subinterval, 
which gives us the piecewise constant function for j = 1,..., P: 


Jk{t) := 


1 

Tj — Tj—i 



dgk dgk 
dui dUn 


dt, 


t £ [Tj-i, Tj]. 


(2.19) 


The IVP is then shifted with respect to the initial condition. 


= [^ + Jk{t)]uk+i{t) -\-gk{t), t£ [o,r], 

Ufe+i(0) = 0, 


( 2 . 20 ) 


where. 


gfe(t) := Auo + gfc(t) + Jk{t)[uo - Ufc(t)]. (2.21) 


The shifted IVP (2.20) can be then solved in parallel. We follow an approach similar 
to the Paraexp method, as explained in Section 2.2. Here, we define 


SjAt) 


gk{t), for <t <Tj, 
0, otherwise. 


( 2 . 22 ) 


Now, the problem is decomposed into P independent subproblems. The subsolutions 
Vj result from convergence of obeying: 

Vyfc+i(f) = [A + Jfe(t)]vj_fc+i(t) +gj,fc(t), t G [0,T], (2.23) 

v,-fe+i(0) = 0, (2.24) 


which can be integrated with the EBK method. Here, we have introduced a second 
subindex k to indicate the jth subsolution at the fcth iteration. The total solution 
can then be updated according to the principle of superposition, 

p 

Ufe+i(t) = uo + ^ Vj,fc+i(t). (2.25) 

1=1 

After each iteration, the solution is assembled, and the time-averaged Jacobian matrix 
is updated at each subinterval, according to Eq. (2.19). Note that for linear systems 
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of ODEs, no iterations are required at all. The parallel algorithm is summarized in 
Fig. 2.5. 

The parallel effiency of the EBK method is achieved by incorporating the paral¬ 
lelization within the iterations that are required to solve nonlinear IVPs. The differ¬ 
ences between the serial and the parallel algorithm are illustrated as flow diagrams in 
Figures 2.3 and 2.4. 



Figure 2.3: Flow diagram of the serial EBK method for nonlinear PDFs. The algo¬ 
rithm stops after K iterations, after which the solution is assumed to be converged. 


We consider the parallel efficiency in an idealized setting, and estimate a theoret¬ 
ical upper bound here, assuming that communication among the processors can be 
carried out very efficiently. The computational cost can then be simply estimated by 
the number of iterations required for the numerical solution to converge. Suppose the 
computation time of a single EBK iteration is tq. The maximum parallel speedup is 
then, 

, KiTo KiP 

where Ki is number of iterations for serial time integration, and Kp for parallel time 
integration. The theoretical upper bound of the parallel efficiency is then, 

efficiency = 2 ^^ = (2.26) 

High parallel efficiency can be achieved if the parallelization does not slow down the 
convergence of the numerical solution. As will be demonstrated in Section 4, we 
typically observe that Kp is not significantly larger than iAi, and a near-optimal ef¬ 
ficiency is achieved in various relevant cases. For comparison, the parallel efficiency 
of the Parareal algorithm is formally bounded by 1/Kp [39]. In our case, this upper 
bound is improved by a factor Ki. Parareal is an iterative method for the paralleliza¬ 
tion of sequential time integration methods, whereas the EBK method, for nonlinear 
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Figure 2.4: Flow diagram of the Paraexp-EBK method for nonlinear PDFs. The 
algorithm stops after K iterations, after which the solution is assumed to be converged. 


problems, is an iterative method to start with, and its parallelization does not neces¬ 
sarily increase the total number of iterations. Note the PFASST method [17] has a 
similar estimate of parallel efficiency as the PEEK method. 


Algorithm parallel time integration. 

Given: A, Uq, g(t,u(t)), ... 

Solve: u'(t) = Au(t) -|- g(t, u(t)), u(0) = Uq. 

for A: = 1,..., AT 

for j = 1,... ,P (in parallel) 

Calculate time-averaged Jacobian matrix Jk{t), see (2.19). 

Solve nonhomogeneous part linearized ODE (2.24), for t G [Tj-i,Tj]. 
Solve homogeneous part linearized ODE (2.24), for t G [Tj,Tp]. 

end for 

Update solution Uk{t) following (2.25). 
end for 


Figure 2.5: The algorithm for solving nonlinear problems with the Paraexp exponen¬ 
tial block Krylov (PEEK) method, see also Fig. 2.4. 


3. Advection-diffusion equation. In this section, we present results of numer¬ 
ical tests where the space-discretized advection-diffusion equation is solved with the 
EEK method. We demonstrate the consistency and stability of the EEK method for 






























A TIME-PARALLEL EXPONENTIAL INTEGRATOR 


11 


the linear advection-diffusion equation, which is a PDE describing a large variety of 
transport phenomena [32]. The spatial discretization, using central finite differencing 
for sake of illustration, of the PDE yields a linear system of ODEs. The time inte¬ 
gration can be parallelized as described in Section 2.2. We illustrate the principle of 
parallelization for different physical parameters of the advection-diffusion equation, 
before we move on to nonlinear PDEs in Section 4. Convergence of the numerical 
solution is observed for different values of the physical parameters. 

In our implementation of the EBK method, the IVP projected onto the block 
Krylov subspace is solved with the function odel5s in MATLAB (2013b). The rela¬ 
tive tolerance of the PEBK solver is denoted by tol. In our tests, the block Krylov 
subspace is restarted every 20 iterations. For the truncated SVD approximation (2.4), 
p(t) are chosen to be piecewise cubic polynomials, although other types of approxi¬ 
mations are possible as well, and are not crucial for the performance of the PEBK 
method. The sample points per subinterval are Chebyshev nodes. This is not crucial 
but gives a slightly better approximation than with uniform sample points. 

3.1. Homogeneous PDE. We consider the advection-diffusion equation with 
a short pulse initial condition, to clearly distinguish the seperate effects of advection 
and diffusion. The PDE and the periodic boundary conditions are as follows 


Ut + aux = i^Uxx, X G [0,1], t G [0,1], 
u{x, 0) = sin^° (ttx) , 
w(0, t) = u(l, t), 


(3.1) 


where a G K. is the advection velocity, and z/ G M the diffusivity coefficient. Both 
parameters are constant in space and time. The PDE is first discretized in space with 
a second-order central finite difference scheme [41]. The corresponding semi-discrete 
system of ODEs is then 


v'(t) = Av(t) -I- Auo, 
v(0) = 0, 


(3.2) 


where the matrix A results from the discretization of the spatial derivatives, and 
represents both the diffusive and the advective term. In (3.2), the substitution u = 
V -|- Uq has been applied, with u(t) being the vector function containing the values of 
the numerical solution on the mesh at time t, with Uq = u(0). The substitution leads 
to homogeneous initial conditions. In this case, the source term is constant in time, 
and its SVD polynomial approximation (2.4) is exact with m = 1. This allows us to 
focus on the two remaining parameters: the grid resolution and the tolerance of the 
EBK solver. 

The linear IVP (3.2) is decoupled into independent subproblems by partitioning 
of the source term (Auq) on the time interval. The superposition of the subsolutions is 
illustrated in Fig. 3.1, in which the time interval of interest, [0,1], has been partitioned 
into four equal subintervals, to be integrated on four processors. The sum of the initial 
condition and the subsolutions gives the final solution on the entire time interval, see 
Section 2.2. 

In the following numerical experiments, P = 8 subintervals are used. The 
advection-diffusion equation is solved for three different combinations of a and iz, 
see Fig. 3.2. The solutions are computed for different tol and Ax. The discretization 
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error is controlled by the time integration with tol, and by the spatial discretization 
with Ax. The error of the numerical solution is measured in the relative ^^-norm, 

||u,(T)-u(r)||/||u(r)||, (3.3) 

where the exact solution u(T) is on the mesh, i.e., it has the entries Uj{T) = u{xj,T). 
The exact solution of Eq. (3.1) is 

20 

u{x, t) = a„ cos {mr{x — at)) exp (^— (mr)^ I't'j , (3.4) 

n=0 


with coefficients 




sin^°(7rx) dx, 


= j sin^°(7rx) cos(n7rx) dx. 


(3.5) 

(3.6) 


The coefficients are calculated using numerical quadrature with high precision. The 
error in (3.3) shows second-order convergence with Ax, as expected from the trun¬ 
cation error of the finite difference scheme. The convergence plots show that we are 
able to control the error of the parallel time integration method. In this case, the final 
error can be made to depend only on the spatial resolution and the tolerance of the 
EBK method. Also, the EBK method has no principal restrictions on the timestep 
size, as it directly approximates the exact solution of Eq. (3.2) by Krylov subspace 
projections. 
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Figure 3.1: Superposition of subsolutions to the advection-diffusion equation, with 
partitioning T = {0,0.25,0.5, 0.75,1}, Ax = 5 • 10“^, a = 1, and v = 10“^. The color 
depends on the time coordinate: blue corresponds to t = 0 and green to t = 1. 
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(a) Velocity o = 0 and diffusivity coefficient v = 10 ^ 




(b) Velocity a = 1 and diffusivity coefficient v — 0. 




(c) Velocity a = 1 and diffusivity coefficient u = 10 ^ 


Figure 3.2: Left: numerical solution on the finest mesh at t = 0 (dashed) and t = 0.25 
(solid). Note that the solution for a = 1 and i/ = 0 does not show odd-even decoupling 
because of the high spatial resolution used. Right: convergence of the numerical 
solution (at final time t = 1), for tol: □ 10“^, o 10“^, x 10“®. 
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3.2. Parallel efficiency. In the previous examples, we solved homogeneous 
IVPs. Nonhomogeneous problems are generally more expensive to solve, because 
an accurate SVD approximation of the source term (2.4) requires more singular val¬ 
ues, which increases the block width of the block Krylov subspace (2.7). Parallel 
speedup can then be expected by splitting the nonhomogeneous problem into sub¬ 
problems (2.12), which require less individual effort to solve by the PEEK method. 
To measure the parallel efhciency of our algorithm for nonhomogeneous IVPs, we 
introduce a source term f(x,t) in the advection-diffusion equation 


ut +aux = (3.7) 

with V = 10“^, and a = 1. The source term is chosen such that the solution is a series 
of five travelling pulses 


u{x, t) = \ — \ cos (107r(a; — t)). (3.8) 

The mesh width of the spatial discretization is Ax = 10“^, such that the error due to 
the spatial discretization is small compared to the time integration error. The mesh 
width gives a semi-discrete system with a 1000 x 1000 matrix, which is a suitable 
problem size for testing the EBK method. The tolerance of the EBK method is set 
to 10“"^. The SVD polynomial approximation is constructed from 100 time samples 
per subinterval. The singular values are plotted in Fig. 3.4, which reveals that the 
first two singular values are several orders larger than the rest. Therefore, we retain 
only the first two singular values in the truncated SVD. The decay of singular values 
is guaranteed by the upper bound from Theorem 2.1. We have verified that the SVD 
approximation is sufficiently accurate, i.e., the approximation error, measured in the 
L^-norm, is less than the tolerance of the EBK method. 

We compare the parallel efhciency of the Paraexp-EBK implementation with a 
convential implementation of the Paraexp algorithm [21], where the nonhomogeneous 
problem is integrated with the Crank-Nicolson (CN) scheme. The linear system is 
solved directly using MATLAB. The matrix exponential propagator, for the homoge¬ 
neous problem in the Paraexp method, is realized with an Arnoldi method using the 
shift-and-invert technique [51]. 

In order to have a Courant number of one, we take At = 10“^ (for Ax = 10“^ and 
a = 1). According to the Paraexp method, the time step size needs to be decreased in 
parallel computation by a factor [21], where q is the order of the time integration 

method, in order to control the error. In case of the Crank-Nicolson scheme, we have 
q = 2. 

As discussed in Section 2.2, there is no communication required between pro¬ 
cessors, except for the superposition of the solutions to the subproblems at the end. 
Therefore, we are able to test the parallel algorithm on a serial computer by measur¬ 
ing the computation time of each independent subproblem. The computation time of 
the serial time integration is denoted tq. For the parallel time integration, we mea¬ 
sure the computation times of the nonhomogeneous and the homogeneous part of the 
subproblems separately, denoted by ri and T 2 respectively. The parallel speedup can 
then be estimated as 


speedup 


_A)_ 

max(ri) -|- max(r 2 ) ’ 


(3.9) 


where we take the maximum value of ti and T 2 over all parallel processes. The timings 
of the EBK method and Paraexp are listed in Table 3.1. 
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P 

Figure 3.3: Parallel efficiency for solv¬ 
ing the advection-diffusion equation. 
□ PEEK; X Paraexp/CN. 



m 


Figure 3.4: Singular values of the matrix 
composed of the source term samples. 


The parallel efficiency of both methods is illustrated in Figure 3.3. Note that 
the parallel efficiency of the standard Paraexp method steadily decreases, whereas 
the PEEK method maintains a constant efficiency level around 90%. The decrease 
in efficiency of the standard Paraexp method is due to the reduced time step size in 
the Crank-Nicolson scheme, with respect to its serial implementation. The nonho- 
mogeneous part of the subproblems requires more computation time as the number 
of processors increases. 

The numerical test confirms the initial assumption that parallel speedup with 
the EEK method can be achieved by decomposing the originial problem into simpler 
independent subproblems (2.12). Furthermore, the Paraexp implementation using the 
EEK method, appears to be more efficient than one using a traditional time-stepping 
method. 
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Table 3.1: Parallel effiency of the Paraxp-EBK method and the Paraexp/Crank- 
Nicolson method for the advection-diffusion equation, with number of processors P. 
Timing tq corresponds to the serial algorithm. For the parallel algorithm, ti and T 2 
are timings of the nonhomogeneous and homogeneous problem respectively. 




Serial 


Parallel 




P 

To 

Error 

max (ti ) 

max (t 2 ) 

Error 

Efficiency 

PEEK 

2 

1.43e+00 

3.05e-04 

7.22e-01 

6.44e-02 

2.70e-04 

91 % 


4 

2.81e+00 

3.05e-04 

7.23e-01 

7.06e-02 

2.68e-04 

89 % 


8 

5.59e+00 

3.05e-04 

7.40e-01 

6.21e-02 

2.88e-04 

87 % 


16 

1.14e+01 

3.05e-04 

7.30e-01 

6.28e-02 

3.22e-04 

90 % 


32 

2.27e+01 

3.05e-04 

7.33e-01 

6.63e-02 

3.65e-04 

89 % 

Paraexp 

2 

2.15e+00 

4.56e-04 

1.29e+00 

1.66e-02 

4.10e-04 

82 % 


4 

4.37e+00 

4.56e-04 

1.52e+00 

1.28e-02 

3.80e-04 

71 % 


8 

8.56e+00 

4.56e-04 

1.85e+00 

1.24e-02 

3.59e-04 

58 % 


16 

1.70e+01 

4.56e-04 

2.18e+00 

1.31e-02 

3.43e-04 

49 % 


32 

3.43e+01 

4.56e-04 

2.62e+00 

1.16e-02 

3.32e-04 

41 % 
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4. Burgers equation. In the previous section, the PEEK method was applied 
to a linear PDE. In this section, the performance of the PEEK method is tested on a 
nonlinear PDE, the viscous Eurgers equation. 

4.1. Travelling wave. Consider the viscous Eurgers equation. 


ut + uua; = i^Uxx + a; e [0, l],t e [0,1], (4.1) 

where v G M. denotes the diffusivity (or viscosity) coefficient. In the following ex¬ 
periments, we take u = 10“^, such that the problem is dominated by the nonlinear 
convective term. As in the previous example, the boundary conditions are periodic. 
Exact solutions to the Eurgers equation can be found by the Cole-Hopf transfor¬ 
mation [27]. In this test case, we construct a desired solution by introducing an 
appropriate source term, as shown for the advection-diffusion equation in Section 3.2. 
The the source term balances the dissipation of energy due to diffusion, and prevents 
the solution of vanishing in the limit t —^ oo. 

The Eurgers equation is an important equation in fluid dynamics. It can be 
seen as a simplification of the Navier-Stokes equation, which retains the nonlinear 
convective term, uux- In the limit —>■ 0, the nonlinearity may produce discontinuous 
solutions, or shocks, so that a typical solution may resemble a sawtooth wave. A 
sawtooth wave can be represented by an infinite Fourier series. A smooth version of 
the sawtooth wave can be obtained by the modified Fourier series. 


1 


1 


"^max 

m(^) = ^ ^ ^sin(27rfc^)$(fc,e), 


(4.2) 


where k is the wavenumber, /cmax a cutoff wavenumber, and 4>(fc, e) is a smoothing 
function, which supresses the amplitudes associated to high wavenumbers: 


$(/c,e) 


T:ke/2 

sinh(7rfce/2) ’ 


(4.3) 


Here, e is the smoothing parameter. The smoothing function $(A:, e) is motivated by 
the viscosity-dependent inertial spectrum of the Eurgers equation found by Chorin & 
Hald [11]. The smoothing function for e = 0.1 is shown in Figure 4.1a. As the smooth¬ 
ing function rapidly decreases with wavenumber, we choose a cutoff wavenumber of 
^max = 100. The value e = 0.1 is found to produce a smooth version of a sawtooth 
wave, as shown in Figure 4.1b. 

We consider a wave travelling in the positive x-direction by introducing the 
parametrization ^ = x — ^ in (4.1). The source term is then readily found 

by substituting the chosen solution (4.2) into the Eurgers equation (4.1), 


f{x, t) = Ut+ UUx - VUxx- 


(4.4) 


The space derivatives in the Eurgers equation are discretized with a second-order 
central finite difference schemes, using a uniform mesh width Aa; > 0. The error of 
the numerical solution is again measured in the relative £^-norm, cf. (3.3). 

The tolerance of the PEEK method is set to 10“'*. The SVD approximation of the 
source term, which includes the nonlinear term, is constructed from s = 50 samples 
per subinterval, and reveals that m = 12 singular values are sufficient in the truncated 
SVD, so that the error of the SVD approximation is less than the tolerance of the 
PEEK method. 
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(a) Smoothing function (solid line) and cut- (b) Solution for e = 0.1 and kmax = 100. 
off wavenumber (dashed line). 

Figure 4.1: Manufactured solution (4.2) of the Burgers equation. 


The nonlinear system of ODEs is solved iteratively, as outlined in Section 2.3. 
Figure 4.2 shows the error history at different grid resolutions. Here, the error con¬ 
verges to a value that depends on the mesh width. In other words, the final error 
of the time integration method is much smaller than the error due to the spatial 
discretization. 


10-1 


S 10-2 

p 

0 ) 

CM 


10-3 


Figure 4.2: Error history with AT = 0.1. □, Ax = 10 2; o. Ax = 5-10 3; A, 
Ax = 2.5-10-3. 



The PEEK method is parallelized as discussed in Section 2.4. The time interval 
[0,T] can be partitioned into P subintervals with a uniform subinterval size AT. In 
the following experiment, the complete time interval is extended with each subinterval 
added, T = PAT. The mesh width is Ax = 2.5 • 10-3. Figure 4.3a shows that 
increasing the number of processors and hence the simulated time, generally increases 
the number of iterations required to achieve the same level of accuracy. The increased 
number of required iterations implies a decrease in theoretical parallel efficiency, which 
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can be estimated by the ratio Ki/Kp, see (2.26). 

Next, the final time is kept constant, and the subinterval size reduces with an 
increase in the number of processors, AT = T /P. Figure 4.3b shows that the number 
of iterations is roughly independent of the number of processors, i.e., in this case, 
the parallel efficiency, Ki/Kp, does not decrease with P. Parallel speedup might 
also be improved by the fact that the SVD approximation converges faster on smaller 
subintervals, see Theorem 2.1. That is, fewer singular values have to be retained in the 
truncated SVD in order to achieve a certain accuracy. Also, the number of samples 
of the source term could be decreased on smaller subintervals. 


10-1 

S 

ho- 

o 

0 ? 

<1 

10-3 

2 4 6 8 10 

Iteration 

(a) AT = 0.1, T = PAT. (b) AT = T/P, T = 0.1. 

Figure 4.3: Error history for the Paraxp-EBK method for a fixed subinterval size and 
a hxed final time. □, T = 1; o, P = 2; A, P = 4; *, P = 8; o, P = 16; x , T = 32. 




4.2. Parallel efficiency. In this section, we test the parallel efficiency of the 
PEEK method for the viscous Burgers equation. In the following experiments, the 
source term is chosen such that the solution of the spatially discretized system is 
equal to the exact solution described in Section 4.1. In other words, the source term 
accounts for the error by the spatial discretization, and we measure only the error 
due to the time integration. The spatial discretization is here performed with a mesh 
width of Ax = 2 • IQ-^. 

If the subinterval size, AT, is fixed, the number of iterations generally increases 
in case the number of processors P increases, see Fig. 4.3. Increasing the number of 
iterations would reduce the parallel efficiency, according to (2.26). We therefore fix 
the final time T, such that the subinterval size decreases with increasing P. Also, the 
total number of samples over [0,T] is fixed, such that the local number of samples 
decreases with increasing P. To keep the global distribution of samples constant in 
the parallel computations, the local sample points are uniformly distributed instead 
of Chebyshev points as used in previous experiments. In our experiments, we use 
AT = 0.2/P and s = 128/P. Furthermore, the number of (retained) singular values 
is m = 12, and the tolerance of the EBK method is set to IQ-^. The error history for 
different P is shown in Fig. 4.4. The results show that the convergence rate of the 
PEEK method can improve by increasing P. The nonlinear corrections, see (2.18), 
appear to be more effective on smaller subintervals. 
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As opposed to the linear problem in Section 3.2, communication between the 
parallel processes is here required because of the waveform relaxation method. The 
parallel communication overhead is assumed to be very small, such that the parallel 
computations can be emulated on a serial computer. Per iteration of the PEEK 
method, the computation time of each individual process is measured, after which 
the maximum is stored, i.e., the computation time of the slowest process. The total 
computation time of the PEEK method is then taken as the sum, over the total 
number of iterations performed, of the maxima. In this experiment, we have measured 
a total of ten PEEK iterations. 

The computation times for v — 10“^ and v = 10“^ are shown in Fig. 4.5a, 
which clearly illustrate a parallel speedup. The slightly higher timings of ^ = 10“^ 
could be attributed to the increased stiffness of the problem. Figure 4.5b shows that 
the parallel efficiency of the PEEK method steadily decreases with the number of 
(virtual) processors, P. There is no significant difference between the performance 
of the method at these two values of the viscosity coefficient. The parallel efficiency 
might even be further tuned in practice, based on the previous observation that the 
required number of iterations could decrease with higher P. Also, the number of 
singular values, m, could possibly be reduced on smaller subintervals, based on (2.1), 
which would further enhance the potential parallel speedup. 

The PFASST algorithm [17] shows a good parallel efficiency for the viscous Eurg- 
ers equation as well. It is unknown whether changing the viscosity coefficient has an 
impact on the performance of PFASST. Our observations point in the direction of 
a good parallel efficiency of the PEEK method for simulations of the Navier-Stokes 
equation at high Reynolds numbers. For these problems there is a high demand for 
highly efficient parallel solvers [53]. The Parareal algorithm was for example reported 
to perform poorly in advection-dominated problems [18, 49]. 



Figure 4.4: Error history for the Paraxp-EEK method with AT = 0.2/P. □, P = 1; 
o, P = 2; A, P = 4; *, P = 8. 


4.3. Multiscale example. Turbulent flow is a multiscale phenomenon that is 
characterized by a wide range of length and time scales. In case of the Eurgers 
equation, we can emulate such a solution by combining two functions that are periodic 
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12 4 8 

P 


(a) X: V = 10~^; □: v = 10“^; dotted line: 
V = 10“^ (ideal); dashed line: u = 10“^ 
(ideal). 



Figure 4.5: Total computation time (left) and parallel efficiency (right) of ten PEEK 
iterations, with AT = 0.2/P. 


in both space and time, e.g., 

u{x, t) = sin(27ra:) sin(27rt) + ^ sin(2fco7rx) sin(2fco7r<), fcp > 1. (4.5) 

fco 

The solution features a large scale mode with wavenumber one, and a smaller scale 
mode with wavenumber > \. This combination allows the construction of an 
arbitrarily wide dynamic range. The factor 1/fco is included in compliance with an 
assumed energy distribution of [u{k)f‘ oc [11], where u{k) is the Fourier transform, 

/ OO 

u{x,t) exp{—2TTik{x + t)) dx dt, (4-6) 

-OO 

where we have used the fact that the solution is periodic in space and time. The 
previous experiment, see Fig. 4.3a, is repeated for the manufactured solution (4.5) 
with different values of fco- Figure 4.6 illustrates that widening the spectrum does not 
significantly affect the convergence of the PEEK iterations. Remarkably, the curves 
appear to form pairs based on P. 

5. Conclusions. We propose an implementation of the Paraexp method with 
enhanced parallelism based on the exponential block Krylov (EEK) method. Fur¬ 
thermore, the method, Paraexp-EEK (PEEK), is extended to solve nonlinear PDEs 
iteratively by a waveform relaxation method. The nonlinear terms are represented 
by a source term in a nonhomogeneous linear system of ODEs. Each iteration the 
source term is updated with the latest solution. The convergence of the iterative 
process can be accelerated by adding a correction term based on the Jacobian matrix 
of the nonlinear term. Each iteration the initial value problem can then be decoupled 
into independent subproblems, which can be solved parallel in time. Essentially, we 
implement the Paraexp method within a waveform relaxation approach in order to in¬ 
tegrate nonlinear PDEs. Also, the Paraxp-EEK (PEEK) method is used to integrate 
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Iteration 


(a) fco = 4 


(b) ko = 16 


Figure 4.6: Error history with AT = 0.1 and T = PAT. □, P = 1; o, P = 2; A, 
P = 4; *, P = 8; o, P = 16; X, P = 32. 


both the homogeneous and the nonhomogeneous parts of the subproblems. This is in 
contrast to the original Paraexp method, which assumes a convential time integration 
method for the nonhomogeneous parts. 

The PEEK method is tested on the advection-diffusion equation for which we 
demonstrate the parallelization concept for linear PDEs. The parallelization also 
works in cases without diffusion present, in which the PDE is purely hyperbolic. The 
parallel efficiency is compared with a Crank-Nicolson (CN) scheme parallelized with 
the Paraexp algorithm. The parallel efficiency of the PEEK method remains roughly 
constant around 90%. On the other hand, the parallel efficiency of the CN/Paraexp 
combination steadily decreases with the number of processors. 

As a model nonlinear PDE, the viscous Eurgers equation is solved. The number 
of waveform relaxation iterations required for a certain error tolerance increases, when 
the relative importance of nonlinearity grows by decreasing the viscosity coefficient. 
Good parallel efficiency of the EEK method was observed for different values of the 
viscosity coefficient. Since the nonlinear convective term in the Eurgers equation is 
shared by the Navier-Stokes equation, the presented results give a hint of the poten¬ 
tial of the PEEK method as an efficient parallel solver in turbulent fluid dynamics, 
where nonlinearity plays a key role. The question remains how to treat the pressure 
in the incompressible Navier-Stokes equation, which enforces the incompressibility 
constraint on the velocity field (see for possible approaches in [15, 43]). This will be 
explored in future work. 
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